Volve G&G Dataset¶

The "Volve" oil field is an offshore oil and gas field located in the North Sea, approximately 190 kilometers west of Stavanger, Norway. It was discovered in 1993 and production began in 2008. The field is owned and operated by Equinor (formerly known as Statoil), which is one of the largest energy companies in the world. The field has estimated recoverable reserves of around 190 million barrels of oil equivalent, and produces mainly crude oil. Volve is a relatively small field compared to some of the other offshore fields in the North Sea, but it is still an important asset for Equinor and for Norway's overall oil and gas industry.

In [1]:
from IPython.display import Image
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import PowerTransformer
from sklearn.ensemble import IsolationForest
from sklearn.cluster import KMeans
from sklearn import metrics
from random import seed

RANDOM_STATE = 417564

seed(RANDOM_STATE)

In markdown cells of jupiter notebooks we can easily display images. There is no need of importing a dedicated python package solely for that purpose :D we can just use some trivial HTML

with auto dimentions

![alt](https://www.norskpetroleum.no/factpages/3420717.jpg)

alt

with fixed dimentions

<img src="https://www.norskpetroleum.no/factpages/3420717.jpg" alt="drawing" width="700"/>
drawing
Image 1 Image 2 Image 3 Image 4 Image 5 Image 6
In [2]:
Image(url="https://www.norskpetroleum.no/factpages/3420717.jpg", width=700)
Out[2]:
No description has been provided for this image

What kind of well-logs we will use?¶

  • NPHI (neutron porosity): It is used to estimate the amount of fluids (usually hydrocarbons) contained in the rock formation by measuring the amount of neutron radiation that is emitted by the rock and reflected back to the sensor. This is important in the oil and gas industry to determine the potential productivity of a reservoir.

  • RHOB (bulk density): It is used to determine the weight of the rock formation per unit volume. This is important for calculating the overall density of the rock formation and understanding its mechanical properties.

  • GR (gamma ray): It is used to measure the amount of natural radiation that is emitted by the rock formation. This information can be used to identify certain types of rock formations, such as shale, and to estimate the amount of organic matter present in the formation.

  • RT (resistivity): It is used to measure the electrical resistance of the rock formation to the flow of electric current. This information can be used to determine the presence and quality of fluids within the rock formation.

  • PEF (photoelectric factor): It is used to measure the amount of X-ray radiation that is absorbed by the rock formation. This information can be used to identify certain types of rock formations, such as sandstone, and to estimate the amount of organic matter present in the formation.

  • CALI (caliper): It is used to measure the diameter of the borehole. This information is important for determining the correct size of tools to be used for further measurements and for ensuring the stability of the borehole.

  • DT (compressional travel time): It is used to measure the time it takes for a compressional (P-wave) sound wave to travel a known distance through the rock formation. This information can be used to determine the rock formation's mechanical properties, such as its porosity and permeability.

Step 1 - importing data from Excel file and plotting the logs¶

Task 1¶

Rread Excel file into a pandas dataframe ans save it to the variable df using function pd.read_excel

exaple: variable = pd.read_excel('file_name.xlsx')

prinf df - what is the index of your data?

In [50]:
df = pd.read_excel('./well_subset.xlsx')

print(df.head())

print("\nIndex danych:")
print(df.index)
    DEPTH    NPHI    RHOB      GR      RT     PEF    CALI       DT
0  2800.0  0.1425  2.4629  3.2562  1.7704  8.0126  8.5782  68.2803
1  2800.1  0.1416  2.4583  3.2575  1.7734  8.0124  8.6250  68.4759
2  2800.2  0.1436  2.4548  2.8439  1.8059  8.0316  8.6250  68.6693
3  2800.3  0.1454  2.4504  2.4479  1.8467  8.0325  8.6249  68.7748
4  2800.4  0.1509  2.4438  3.0292  1.9006  7.9983  8.5781  68.8805

Index danych:
RangeIndex(start=0, stop=8001, step=1)

Task 2¶

Set column DEPTH as the index using the following syntax

variable = variable.set_index('column_name')

Print the first 10 rows using the syntax

variable.head(number of rows)

In [51]:
df = df.set_index('DEPTH')

print(df.head(10))
          NPHI    RHOB      GR      RT     PEF    CALI       DT
DEPTH                                                          
2800.0  0.1425  2.4629  3.2562  1.7704  8.0126  8.5782  68.2803
2800.1  0.1416  2.4583  3.2575  1.7734  8.0124  8.6250  68.4759
2800.2  0.1436  2.4548  2.8439  1.8059  8.0316  8.6250  68.6693
2800.3  0.1454  2.4504  2.4479  1.8467  8.0325  8.6249  68.7748
2800.4  0.1509  2.4438  3.0292  1.9006  7.9983  8.5781  68.8805
2800.5  0.1549  2.4343  2.9266  1.9117  7.9443  8.5782  68.9858
2800.6  0.1573  2.4217  3.4017  1.8806  7.9051  8.6250  69.0042
2800.7  0.1632  2.4096  3.7842  1.8404  7.9249  8.6250  69.0204
2800.8  0.1679  2.4020  3.1949  1.8093  7.9677  8.6250  69.0371
2800.9  0.1703  2.4000  2.6821  1.7874  7.9796  8.6250  69.2040

Task 3 import library for ploting or install if needed. Specify the list of colors and plot the logs.¶

In [52]:
colors = ['blue', 'green', 'red', 'purple', 'orange', 'brown', 'teal']


# We are going to reuse that utility function later on
def plot_logs(df):
    fig, axs = plt.subplots(ncols=len(df.columns), figsize=(15, 10), gridspec_kw=dict(wspace=0.9))

    for i, col in enumerate(df.columns):
        axs[i].plot(df.iloc[:, i], df.index, color=colors[i], linewidth=0.5)
        axs[i].set_xlabel(col)
        axs[i].xaxis.label.set_color(colors[i])
        axs[i].set_xlim(df.iloc[:, i].min(), df.iloc[:, i].max())
        axs[i].set_ylabel("Depth [m]")  # etykieta osi Y ze wskazaniem jednostki
        axs[i].tick_params(axis='x', colors=colors[i])
        axs[i].spines["top"].set_edgecolor(colors[i])
        axs[i].title.set_color(colors[i])
        axs[i].set_xticks([df.iloc[:, i].min(), df.iloc[:, i].max()])

    plt.gca().invert_yaxis()
    plt.show()


plot_logs(df)
No description has been provided for this image

Task 4 Can you notice something strange? What with RT curve? write your comment¶

In [57]:
df['DT'].describe()
Out[57]:
count    8001.00000
mean       73.68914
std        11.59884
min        56.38230
25%        66.33490
50%        71.62930
75%        76.51870
max       116.23160
Name: DT, dtype: float64
In [56]:
df['NPHI'].describe()
Out[56]:
count    8001.000000
mean        0.151355
std         0.100839
min         0.029000
25%         0.090100
50%         0.127000
75%         0.167600
max         0.593200
Name: NPHI, dtype: float64
In [58]:
df['RT'].describe()
Out[58]:
count     8001.000000
mean        11.529283
std        540.312611
min          0.142500
25%          2.387500
50%          3.296700
75%          4.837000
max      46224.449200
Name: RT, dtype: float64

The values range from 0.14 to over 46,000, which spans more than five orders of magnitude, which means 10^5 times greater values.

This extremely wide range makes it difficult to visualize on a linear scale, and may result in the curve appearing almost flat in many zones, hiding valuable details.

Additionally, the standard deviation (540.31) is extremely high compared to the mean (11.53), suggesting the presence of strong outliers or a highly skewed distribution.

Task 5 Check outliers - what is the outlier??¶

In [6]:
# We are going to reuse that utility function later on
def plot_boxplots(df):
    fig, axs = plt.subplots(ncols=len(df.columns), figsize=(20, 5), gridspec_kw=dict(wspace=0.5))

    for i, col in enumerate(df.columns):
        axs[i].boxplot(df[col])
        axs[i].set_xlabel(col)
        axs[i].set_ylabel("Value")
        axs[i].tick_params(axis='x', colors=colors[i])
        axs[i].spines["top"].set_edgecolor(colors[i])
        axs[i].title.set_color(colors[i])


plot_boxplots(df)
No description has been provided for this image

We need to remove weird high values of RT, but what to do with this observations? We can replace them using interpolation from the nearest samples

TL;DR: Yes, we can


We can replace these weirdly high values of RT with interpolated values because these extreme outliers likely do not represent realistic geological measurements—they may result from sensor errors or data transmission issues.

Interpolation from neighboring samples assumes that resistivity changes gradually with depth, so using nearby valid data helps restore a smooth, realistic curve without introducing artificial spikes or gaps.

Task 6 Correct RT curve¶

In [7]:
old_df = df.copy()

df.loc[df['RT'] > 100, 'RT'] = np.nan

df['RT'] = df['RT'].interpolate(method='nearest')

print(df)
          NPHI    RHOB       GR      RT     PEF    CALI       DT
DEPTH                                                           
2800.0  0.1425  2.4629   3.2562  1.7704  8.0126  8.5782  68.2803
2800.1  0.1416  2.4583   3.2575  1.7734  8.0124  8.6250  68.4759
2800.2  0.1436  2.4548   2.8439  1.8059  8.0316  8.6250  68.6693
2800.3  0.1454  2.4504   2.4479  1.8467  8.0325  8.6249  68.7748
2800.4  0.1509  2.4438   3.0292  1.9006  7.9983  8.5781  68.8805
...        ...     ...      ...     ...     ...     ...      ...
3599.6  0.1289  2.5771  44.3674  2.3147  6.1787  8.5781  70.1850
3599.7  0.1259  2.5490  43.5794  2.3004  5.9839  8.5781  70.3162
3599.8  0.1312  2.5246  44.6774  2.2336  5.8875  8.5781  70.5137
3599.9  0.1365  2.5003  45.4844  2.1827  5.7913  8.5781  70.7711
3600.0  0.1470  2.4950  47.8596  2.1170  5.7226  8.5784  71.3462

[8001 rows x 7 columns]

See if it helped! We will visualize data again¶

In [8]:
plot_logs(df)
No description has been provided for this image
In [9]:
plot_boxplots(df)
No description has been provided for this image

Task 7 check the outliers visualization using seaborn library¶

check the seaborn website: https://seaborn.pydata.org/examples/index.html write in the comment 3 visualizations that you can use in your daily tasks

In [10]:
for i, col in enumerate(df.columns):
    fig, ax = plt.subplots(figsize=(4, 4))
    sns.boxplot(y=df[col], color=colors[i], orient='v')
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

After applying interpolation using the nearest neighbors, the RT curve looks much smoother and more continuous compared to the original.

The extreme high values (up to ~46000) were successfully removed, and the missing or anomalous data points were filled in using values from adjacent depths. This resulted in a realistic and geologically plausible well-log that preserves the general shape of the curve without abrupt spikes or gaps.

Task 8 - Remove outliers that are 3 standard deviations from the mean in window of 100 samples¶

In [11]:
window_size = 100

z_scores = (df - df.rolling(window_size).mean()) / df.rolling(window_size).std()

df_noout = df[(np.abs(z_scores) < 3).all(axis=1)]

print(df_noout)
          NPHI    RHOB       GR      RT     PEF    CALI       DT
DEPTH                                                           
2809.9  0.1427  2.4589   4.2370  2.0991  7.9082  8.6249  70.0120
2810.0  0.1402  2.4615   4.1468  2.0325  7.9031  8.5782  69.9705
2810.1  0.1393  2.4696   4.2389  1.9359  7.9372  8.6015  69.8849
2810.2  0.1348  2.4811   3.6711  1.8266  7.9944  8.5781  69.7992
2810.3  0.1309  2.4924   2.8638  1.7334  8.0531  8.5781  69.7205
...        ...     ...      ...     ...     ...     ...      ...
3599.0  0.1207  2.4931  41.2866  1.4664  5.8496  8.7030  71.5955
3599.1  0.1200  2.5050  42.4855  1.6046  6.0435  8.6718  71.1964
3599.2  0.1240  2.5236  42.9226  1.7066  6.2858  8.6252  70.8162
3599.9  0.1365  2.5003  45.4844  2.1827  5.7913  8.5781  70.7711
3600.0  0.1470  2.4950  47.8596  2.1170  5.7226  8.5784  71.3462

[6618 rows x 7 columns]
In [12]:
# make a for loop for visualization using df dataset
plot_logs(df)

# make a for loop for visualization using df_noout dataset
plot_logs(df_noout)
No description has been provided for this image
No description has been provided for this image

Step 2 - Exploratory Data Analysis¶

EDA (Exploratory Data Analysis) is the process of analyzing and visualizing data in order to extract insights, patterns, and trends. It is typically one of the first steps in data analysis and is used to gain a better understanding of the data and its characteristics. EDA can help identify outliers, missing values, and any other issues with the data that may need to be addressed before further analysis. It can also help in selecting appropriate statistical methods and models for data analysis. EDA involves using a range of techniques such as histograms, scatter plots, box plots, and correlation matrices to explore the data visually and identify relationships between variables. EDA is an important part of data science and plays a crucial role in the data analysis process.

Histograms¶

Histograms are graphical representations of the distribution of data. They display the frequency distribution of a variable by creating a set of contiguous and non-overlapping intervals (or bins) along the range of the variable and then plotting the count or proportion of observations that fall within each bin. By examining a histogram, one can see the central tendency, variability, and shape of the distribution of the data. Additionally, it can also help identify any outliers or unusual patterns in the data. Overall, histograms are useful for understanding the distribution of a variable and gaining insight into the underlying patterns and characteristics of the data.

Optimal bin size

Freedman-Diaconis rule is a method for determining the bin width of a histogram in statistical data analysis. The rule uses the interquartile range (IQR) of the data and the total number of samples to calculate an appropriate bin width. The bin width is important because it determines the smoothness of the histogram and can affect the interpretation of the data. The Freedman-Diaconis rule aims to create a histogram with sufficient detail to reveal the underlying distribution of the data while avoiding oversmoothing or undersmoothing. It is considered a robust method for determining bin width because it is less sensitive to outliers than other methods such as Sturges' rule or Scott's rule. The Freedman-Diaconis rule has been widely adopted in various fields, including economics, environmental science, and medical research.

In [13]:
fig, axs = plt.subplots(ncols=len(df_noout.columns), figsize=(20, 5), gridspec_kw=dict(wspace=0.5))

for i, col in enumerate(df_noout.columns):
    q75, q25 = np.percentile(df_noout[col].dropna(), [75, 25])

    iqr = q75 - q25

    n = len(df_noout[col].dropna())
    h = 2 * iqr / (n ** (1 / 3)) if n > 0 else 1

    bins = int((df_noout[col].max() - df_noout[col].min()) / h) if h > 0 else 10
    bins = max(bins, 1)

    axs[i].hist(df_noout[col].dropna(), bins=bins, color=colors[i])
    axs[i].set_xlabel(col)
    axs[i].set_ylabel("N")
    axs[i].tick_params(axis='x', colors=colors[i])
    axs[i].spines["top"].set_edgecolor(colors[i])
    axs[i].title.set_color(colors[i])
No description has been provided for this image

Pairplot¶

A pairplot is a graphical tool used in data analysis and visualization that creates pairwise scatterplots and histograms for a given dataset. It is a useful method for exploring the relationship between multiple variables in a dataset. Each scatterplot in the pairplot shows the relationship between two variables in the dataset, while the histograms show the distribution of each variable individually.

By examining the pairplot, we can gain insights into the relationships between variables in the dataset. We can identify variables that are strongly correlated, positively or negatively, as well as variables that are not correlated at all. We can also see the distribution of each variable, including whether they are normally distributed or skewed, and identify any outliers. Pairplots can be useful for identifying potential patterns or trends in the data, as well as for detecting any issues with the data, such as missing or erroneous values. Overall, pairplots are a useful tool for exploratory data analysis and for gaining a better understanding of the relationships between variables in a dataset.

In [14]:
def plot_pairplot(data):
    cols = ['NPHI', 'RHOB', 'GR', 'RT', 'PEF', 'CALI', 'DT']

    hist_color = 'lightgreen'
    scatter_color = 'lavender'

    sns.set_palette("pastel")

    sns.pairplot(data, vars=cols, diag_kind='kde',
                 plot_kws={'alpha': 0.6, 's': 20, 'edgecolor': scatter_color},
                 diag_kws={'color': hist_color})


plot_pairplot(df_noout)
No description has been provided for this image

correlation coefficient

simple-accounting.org

Correlation matrix¶

Correlation matrix is a table that displays the correlation coefficients between different variables in a dataset. It is commonly used in statistics and data analysis to identify patterns and relationships between variables.

The correlation coefficient is a statistical measure that ranges from -1 to 1, indicating the strength and direction of the linear relationship between two variables. A value of 1 indicates a perfect positive correlation, meaning that when one variable increases, the other variable increases proportionally. A value of -1 indicates a perfect negative correlation, meaning that when one variable increases, the other variable decreases proportionally. A value of 0 indicates no correlation, meaning that there is no relationship between the variables.

The correlation matrix is a square matrix where the diagonal contains the correlation coefficient between each variable and itself, which is always 1. The upper and lower triangles of the matrix contain the correlation coefficients between each pair of variables, with duplicates reflected across the diagonal. A correlation matrix can be visualized as a heat map, where the color of each cell represents the magnitude of the correlation coefficient. Correlation matrices are commonly used in data analysis, machine learning, and other applications to identify relationships between variables, detect multicollinearity, and perform feature selection.

In [15]:
cols = ['NPHI', 'RHOB', 'GR', 'RT', 'PEF', 'CALI', 'DT']

corr_matrix = df_noout[cols].corr()

sns.set_palette("pastel")

sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
Out[15]:
<Axes: >
No description has been provided for this image
In [16]:
cols = ['NPHI', 'RHOB', 'GR', 'RT', 'PEF', 'CALI', 'DT']

corr_matrix = df_noout[cols].corr()

sns.set_palette("pastel")

mask = np.triu(np.ones_like(corr_matrix, dtype=bool))

plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, mask=mask, cmap='coolwarm', square=True, linewidths=0.5, cbar_kws={"shrink": 0.8})
plt.title("Correlation Matrix (Lower Triangle Only)")
plt.show()
No description has been provided for this image

Step 3 - data normalization¶

Data normalization, also known as feature scaling, is the process of transforming data into a common scale or range in order to facilitate data analysis and improve the performance of machine learning algorithms. Normalization is important because many machine learning algorithms are sensitive to the scale and distribution of input features, and may perform poorly or inaccurately if the features are not on a similar scale.

Normalization involves rescaling the features of a dataset to have a mean of 0 and a standard deviation of 1, or scaling the features to a range between 0 and 1. The normalization method used depends on the specific data and the requirements of the analysis or algorithm being used. Common methods of normalization include Min-Max scaling, Z-score normalization, and Log transformation.

Min-Max scaling involves scaling the features to a range between 0 and 1, where the minimum value of the feature is transformed to 0 and the maximum value is transformed to 1. Z-score normalization involves transforming the features so that they have a mean of 0 and a standard deviation of 1, which can be accomplished by subtracting the mean from each value and then dividing by the standard deviation. Log transformation is another normalization technique used for data that is highly skewed or has a wide range of values, and involves applying a logarithmic function to the data to transform it into a more normal distribution.

Overall, data normalization is an important preprocessing step in data analysis and machine learning, as it can help to improve the accuracy and performance of models and algorithms, reduce overfitting, and ensure that features are on a similar scale.

3.1 - Transform Resitivity Log to log scale¶

Resistivity data in well logs is typically measured in ohm-meters, and the values can span several orders of magnitude, making it difficult to visualize and analyze the data directly. To address this issue, resistivity data is often transformed using a logarithmic scale, which compresses the data into a more manageable range.

The logarithmic scale is a nonlinear scale that compresses large values into a smaller range, while expanding small values. This allows for a more accurate visualization of the data and makes it easier to identify patterns and trends. In particular, the use of logarithmic scales is useful for resistivity data because the range of resistivity values encountered in well logs can be very large, spanning several orders of magnitude.

In [17]:
df_noout = df_noout.copy()
df_noout['RT'] = np.log10(df_noout['RT'])
In [18]:
df_noout['RT']

df_noout.columns
Out[18]:
Index(['NPHI', 'RHOB', 'GR', 'RT', 'PEF', 'CALI', 'DT'], dtype='object')

3.2 - Transform data with skewed distribution¶

The power transform with Yeo-Johnson method is a data transformation technique used to normalize a dataset that has a skewed distribution. It is a variant of the Box-Cox transformation, which is used to normalize data that has a positive skew.

The Yeo-Johnson method is a modified version of the Box-Cox transformation that can be applied to both positively and negatively skewed data, as well as data that includes zero or negative values. It works by applying a power transformation to the data that varies based on the value of a lambda parameter, which is estimated from the data.

The Yeo-Johnson method is implemented in the PowerTransformer class of the scikit-learn library in Python. It can be used to transform a pandas DataFrame or numpy array to have a more normal distribution, which can be useful for machine learning models that assume a normal distribution of the data.

In [19]:
from IPython.display import Image

Image(url="https://i.postimg.cc/sDRksm7T/2.png")
Out[19]:
No description has been provided for this image
In [20]:
numeric_features = ['NPHI', 'RHOB', 'GR', 'RT', 'PEF', 'CALI', 'DT']

numeric_transformer = Pipeline(steps=[
    ('scaler', PowerTransformer(method='yeo-johnson', standardize=True))
])

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features)
])

transformed_data = preprocessor.fit_transform(df_noout[numeric_features])

transformed_data = pd.DataFrame(transformed_data, columns=numeric_features)
transformed_data['DEPTH'] = df_noout.reset_index()['DEPTH']
transformed_data = transformed_data.set_index('DEPTH')

DRAW Pairplot using transformed data¶

In [21]:
plot_pairplot(transformed_data)
No description has been provided for this image

DRAW boxplots using transformed data¶

In [22]:
plot_boxplots(transformed_data)
No description has been provided for this image

Step 4 - final outlier removal using ML¶

The Isolation Forest algorithm is a type of computer program that helps find "outliers" in data. Outliers are data points that are very different from all the other data points. For example, imagine you have a list of test scores from your class, and one student got a score that is much higher or lower than everyone else. That student's score would be an outlier.

The Isolation Forest algorithm works by putting each data point in a "tree". Each tree has branches that divide the data into smaller and smaller groups. The algorithm keeps dividing the data until each point is in its own group, or until a certain number of groups have been made. This is like playing a game of "guess who" where you try to guess a character by asking yes-or-no questions, and keep dividing the characters into smaller groups until you know who the character is.

Once the data points are divided into groups, the algorithm looks at how many times each point was in a group with other points. If a point was in a group with other points many times, it is not an outlier. But if a point was in a group by itself many times, it is an outlier.

The Isolation Forest algorithm can be useful for finding outliers in data, which can be helpful in many different fields like finance, healthcare, and more.

In [23]:
from IPython.display import Image

Image(url="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*D78QLbcwXesymhquuofnOg.png")
Out[23]:
No description has been provided for this image
In [24]:
numeric_cols = ['NPHI', 'RHOB', 'GR', 'RT', 'PEF', 'CALI', 'DT']

isos = {}
for col in numeric_cols:
    iso = IsolationForest(n_estimators=100, max_samples='auto', contamination=float(0.05), random_state=RANDOM_STATE)
    iso.fit(transformed_data[col].values.reshape(-1, 1))
    isos[col] = iso
In [25]:
# Replace the outliers with the mean of neighboring values for each column in a single loop
for col in numeric_cols:
    outliers = isos[col].predict(transformed_data[col].values.reshape(-1, 1)) == -1
    values = transformed_data[col].values
    mean = np.mean(values[~outliers])
    for i in range(len(values)):
        if outliers[i]:
            # Replace outlier with the mean of neighboring values
            if i == 0:
                values[i] = values[i + 1]
            elif i == len(values) - 1:
                values[i] = values[i - 1]
            else:
                values[i] = (values[i - 1] + values[i + 1]) / 2
    transformed_data[col] = values
In [26]:
# Fill any remaining NaN values with interpolated values
transformed_data[numeric_cols] = transformed_data[numeric_cols].interpolate()
In [27]:
# Print the resulting DataFrame without outliers
print(transformed_data)
            NPHI      RHOB        GR        RT       PEF          CALI  \
DEPTH                                                                    
2809.9  0.205930 -0.745286 -1.564246 -0.607747  1.055451  8.187895e-16   
2810.0  0.167616 -0.721726 -1.581017 -0.649998  1.047754 -3.830269e-15   
2810.1  0.153634 -0.647008 -1.563896 -0.712988  1.099386 -1.471046e-15   
2810.2  0.082195 -0.537420 -1.674785 -0.786870  1.186874 -3.830269e-15   
2810.3  0.018157 -0.425602 -1.859007 -0.852242  1.277809 -3.830269e-15   
...          ...       ...       ...       ...       ...           ...   
3599.0 -0.159180 -0.418538  0.391791 -1.053360 -1.395774  7.965850e-15   
3599.1 -0.171897 -0.295911  0.417180 -0.946528 -1.216829  5.204170e-15   
3599.2 -0.100195 -0.094388  0.426257 -0.871470 -0.979093  8.604228e-16   
3599.9  0.109486 -0.344918  0.477658 -0.555989 -1.447658 -3.830269e-15   
3600.0  0.270070 -0.399279  0.522775 -0.596548 -1.507675 -3.816392e-15   

              DT  
DEPTH             
2809.9 -0.211915  
2810.0 -0.217361  
2810.1 -0.228633  
2810.2 -0.239972  
2810.3 -0.250432  
...          ...  
3599.0 -0.013084  
3599.1 -0.061596  
3599.2 -0.108796  
3599.9 -0.114460  
3600.0 -0.043265  

[6618 rows x 7 columns]
In [28]:
plot_boxplots(transformed_data)
No description has been provided for this image

Step 5 - Facies analysis¶

Facies analysis from well logs is an important task in petroleum geology. It involves identifying and classifying the different rock types, or facies, encountered in a well. Traditionally, this has been done by geologists through visual inspection of the well logs. However, with the rise of machine learning techniques, it has become possible to automate this process.

One common approach to facies analysis is to use unsupervised clustering algorithms, such as K-means or hierarchical clustering, to group similar sections of the well log together. The goal is to identify natural groupings, or clusters, of log responses that correspond to different facies. Once the clusters have been identified, the geologist can assign each cluster to a specific facies based on their knowledge of the local geology.

To perform facies analysis using machine learning clustering, several well logs are typically collected and pre-processed. The logs are usually normalized and scaled to make them comparable, and any missing data is imputed or removed. Features, such as gamma-ray, resistivity, and porosity, are extracted from the logs and used as inputs to the clustering algorithm.

Once the features have been extracted, a clustering algorithm is applied to group the sections of the well log that have similar responses. The algorithm assigns each section to a cluster, which can be visualized on a plot. The plot can help identify any clear patterns or trends in the data and help the geologist make sense of the clustering results.

The next step is to assign each cluster to a specific facies. This is done by comparing the clustering results to the geological knowledge of the area. The geologist may use other data sources, such as core samples or outcrop data, to help identify the facies associated with each cluster.

Once the clusters have been assigned to facies, the results can be used to create a facies log. This log can be used to interpret the geology of the well and to help identify potential hydrocarbon reservoirs or other geological features.

In [29]:
Image(url="https://interviewquery-cms-images.s3-us-west-1.amazonaws.com/ac5da238-25ab-48ef-839a-407a7b76a167.jpg")
Out[29]:
No description has been provided for this image

K-means¶

is a type of unsupervised learning algorithm used for clustering data points into different groups or clusters based on the similarity of the data points. The algorithm works by iteratively assigning each data point to the nearest cluster center, and then computing the new cluster centers based on the mean of the assigned points. This process is repeated until the cluster centers no longer move significantly.

The algorithm requires the user to specify the number of clusters beforehand. The objective of the algorithm is to minimize the sum of squared distances between each data point and its assigned cluster center, which is also known as the Within-Cluster-Sum-of-Squares (WCSS) metric.

The k-means algorithm can be divided into three main steps:

  1. Initialization: The algorithm randomly selects k data points to act as initial cluster centers.

  2. Assignment: Each data point is assigned to the nearest cluster center based on the Euclidean distance between the point and the center.

  3. Update: The mean of the data points assigned to each cluster is computed, and this value is used as the new cluster center.

These three steps are repeated iteratively until the cluster centers no longer move significantly, or a maximum number of iterations is reached. The final output of the algorithm is the cluster assignments of each data point.

One of the main advantages of the k-means algorithm is its simplicity and scalability, which allows it to handle large datasets efficiently. However, the algorithm is sensitive to the initial placement of the cluster centers, and may converge to suboptimal solutions. In addition, the algorithm is not effective when dealing with non-linearly separable data. Nonetheless, k-means is widely used in various applications such as image segmentation, market segmentation, and anomaly detection.

In [30]:
Image(
    url="https://ds055uzetaobb.cloudfront.net/brioche/uploads/y4KGN92h7r-screen-shot-2016-05-05-at-43007-pm.png?width=2000")
Out[30]:
No description has been provided for this image
In [31]:
cols = ['NPHI', 'RHOB', 'GR', 'RT', 'PEF', 'CALI', 'DT']
X = transformed_data[cols]
In [32]:
model = KMeans(n_clusters=4, random_state=RANDOM_STATE)
In [33]:
y = model.fit_predict(X)
y
Out[33]:
array([1, 1, 1, ..., 0, 0, 0])
In [34]:
transformed_data['K10'] = y
transformed_data['K10_name'] = "Facies " + (transformed_data['K10'] + 1).astype('str')
transformed_data = transformed_data.reset_index()

transformed_data
Out[34]:
DEPTH NPHI RHOB GR RT PEF CALI DT K10 K10_name
0 2809.9 0.205930 -0.745286 -1.564246 -0.607747 1.055451 8.187895e-16 -0.211915 1 Facies 2
1 2810.0 0.167616 -0.721726 -1.581017 -0.649998 1.047754 -3.830269e-15 -0.217361 1 Facies 2
2 2810.1 0.153634 -0.647008 -1.563896 -0.712988 1.099386 -1.471046e-15 -0.228633 1 Facies 2
3 2810.2 0.082195 -0.537420 -1.674785 -0.786870 1.186874 -3.830269e-15 -0.239972 1 Facies 2
4 2810.3 0.018157 -0.425602 -1.859007 -0.852242 1.277809 -3.830269e-15 -0.250432 1 Facies 2
... ... ... ... ... ... ... ... ... ... ...
6613 3599.0 -0.159180 -0.418538 0.391791 -1.053360 -1.395774 7.965850e-15 -0.013084 0 Facies 1
6614 3599.1 -0.171897 -0.295911 0.417180 -0.946528 -1.216829 5.204170e-15 -0.061596 0 Facies 1
6615 3599.2 -0.100195 -0.094388 0.426257 -0.871470 -0.979093 8.604228e-16 -0.108796 0 Facies 1
6616 3599.9 0.109486 -0.344918 0.477658 -0.555989 -1.447658 -3.830269e-15 -0.114460 0 Facies 1
6617 3600.0 0.270070 -0.399279 0.522775 -0.596548 -1.507675 -3.816392e-15 -0.043265 0 Facies 1

6618 rows × 10 columns

Print transformed_data what do you see?¶

In [35]:
K5_graph = transformed_data.copy()
K5_graph['x'] = 1
sns.set_style("white")
sns.set(font_scale=1.5)
plt.rcParams['xtick.major.size'] = 200
plt.rcParams['xtick.major.width'] = 40
plt.rcParams['xtick.bottom'] = True
plt.rcParams['ytick.left'] = True
plt.figure(figsize=(3, 12))
ax = sns.scatterplot(data=K5_graph, x='x', y='DEPTH', hue='K10_name', marker='s', s=50000, edgecolor='None',
                     palette='pastel')
plt.tick_params(bottom=False, labelbottom=False)
plt.yticks(K5_graph['DEPTH'][::500])
plt.legend([], [], frameon=False)
plt.ylabel("Depth [m]")
plt.xlabel('')
plt.title("KMeans \n n = 10")
ax.invert_yaxis()
plt.tight_layout()
plt.rcParams.update({})
No description has been provided for this image
In [36]:
def score(n_clusters, data, cols):
    model = KMeans(n_clusters=n_clusters, max_iter=300, random_state=RANDOM_STATE)
    X = transformed_data[cols]
    y = model.fit_predict(X)
    SSE = model.inertia_
    Silhouette = metrics.silhouette_score(X, y)
    CHS = metrics.calinski_harabasz_score(X, y)
    DBS = metrics.davies_bouldin_score(X, y)
    return {'SSE': SSE, 'Silhouette': Silhouette, 'Calinski_Harabasz': CHS, 'Davies_Bouldin': DBS, 'model': model}
In [37]:
df_cluster_scorer = pd.DataFrame()
df_cluster_scorer['n_clusters'] = list(range(2, 21))
df_cluster_scorer
Out[37]:
n_clusters
0 2
1 3
2 4
3 5
4 6
5 7
6 8
7 9
8 10
9 11
10 12
11 13
12 14
13 15
14 16
15 17
16 18
17 19
18 20
In [38]:
df_cluster_scorer['SSE'], df_cluster_scorer['Silhouette'], \
    df_cluster_scorer['Calinski_Harabasz'], df_cluster_scorer['Davies_Bouldin'], \
    df_cluster_scorer['model'] = zip(
    *df_cluster_scorer['n_clusters'].map(lambda row: score(row, transformed_data, cols).values()))

df_cluster_scorer
Out[38]:
n_clusters SSE Silhouette Calinski_Harabasz Davies_Bouldin model
0 2 22232.880540 0.398190 5067.494068 1.011611 KMeans(n_clusters=2, random_state=417564)
1 3 17696.018759 0.357794 4030.780599 0.965663 KMeans(n_clusters=3, random_state=417564)
2 4 12432.712985 0.378656 4757.543098 0.940378 KMeans(n_clusters=4, random_state=417564)
3 5 9574.122176 0.396172 5126.432794 0.827832 KMeans(n_clusters=5, random_state=417564)
4 6 8581.762798 0.384683 4727.608788 0.971837 KMeans(n_clusters=6, random_state=417564)
5 7 7531.187358 0.355745 4642.301652 0.994722 KMeans(n_clusters=7, random_state=417564)
6 8 7109.848682 0.333390 4270.251125 1.051438 KMeans(random_state=417564)
7 9 6462.448353 0.350017 4192.890411 0.947022 KMeans(n_clusters=9, random_state=417564)
8 10 5454.446530 0.357063 4550.796223 0.877808 KMeans(n_clusters=10, random_state=417564)
9 11 5080.274323 0.336127 4445.370587 0.914773 KMeans(n_clusters=11, random_state=417564)
10 12 4763.607968 0.318390 4349.162127 0.984039 KMeans(n_clusters=12, random_state=417564)
11 13 4602.035936 0.314509 4145.401175 1.033802 KMeans(n_clusters=13, random_state=417564)
12 14 4207.526813 0.331770 4232.309641 1.009040 KMeans(n_clusters=14, random_state=417564)
13 15 4073.050285 0.324710 4074.726448 1.011548 KMeans(n_clusters=15, random_state=417564)
14 16 3755.497996 0.338240 4161.233490 0.967796 KMeans(n_clusters=16, random_state=417564)
15 17 3648.448144 0.327315 4027.130842 0.991153 KMeans(n_clusters=17, random_state=417564)
16 18 3431.585761 0.323768 4053.679094 1.005635 KMeans(n_clusters=18, random_state=417564)
17 19 3217.576053 0.329379 4106.878915 0.971526 KMeans(n_clusters=19, random_state=417564)
18 20 3060.838660 0.323180 4107.131933 1.006703 KMeans(n_clusters=20, random_state=417564)
In [39]:
plt.rcParams.update({})
sns.set()
sns.set_style("whitegrid")

fig = plt.figure(figsize=(8, 6))

# Plot each graph on its own axis
df_cluster_scorer.plot.line(x='n_clusters', y='SSE', title="SSE")
df_cluster_scorer.plot.line(x='n_clusters', y='Silhouette', title="Silhouette -  Values closer to 1 - optimum n number")
df_cluster_scorer.plot.line(x='n_clusters', y='Calinski_Harabasz',
                            title="Calinski-Harabasz - Higher value - optimum n number")
df_cluster_scorer.plot.line(x='n_clusters', y='Davies_Bouldin',
                            title="Davies-Bouldin - Values closer to 0 - optimum n number")

#plt.tight_layout()
plt.show()
<Figure size 800x600 with 0 Axes>
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Choose optimum number of clusters and predict!¶

I've selected the Davies-Bouldin Index to determine the optimal amount of clusters, maily because the plot corresponding to that method is the easiest to read and further reason about.

After thoroughly analysing this one particular chart and after comparing it to others I decided, that 5 is the optimal number of clusters in this case 5.

This value is also easy to be confirmed by the remaining charts.

In [40]:
cols = ['NPHI', 'RHOB', 'GR', 'RT', 'PEF', 'CALI', 'DT']
X = transformed_data[cols]

# Fit KMeans with 5 clusters
kmeans = KMeans(n_clusters=5, random_state=RANDOM_STATE)
transformed_data['Facies'] = kmeans.fit_predict(X)

print(transformed_data['Facies'].value_counts().sort_index())
Facies
0    1057
1    1258
2     616
3    2376
4    1311
Name: count, dtype: int64

Compare your results with the stratigrapy¶

https://www.researchgate.net/publication/332441275_Estimation_of_Pore_Pressure_and_Fracture_Gradient_in_Volve_Field_Norwegian_North_Sea

no-access